Probabilistic grammar and the Portuguese Stress Corpus
نویسنده
چکیده
This paper proposes a weight-based probabilistic approach to stress in Portuguese. Previous analyses have argued that weight-sensitivity in the language is categorical and restricted to word-final syllables. I show that weight effects in Portuguese are gradient and can be found across all three positions in the stress domain (three-syllable window). I also compare two domains of weight computation, namely, the syllable and the interval [5], and show that interval-based statistical models are more internally consistent. A thorough phonological analysis of stress in a given language requires a comprehensive and detailed corpus. Unlike frequency corpora, such a corpus must contain quantitative and qualitative segmental information as well as stress marking, syllabification, syllable shapes and weight profiles, among other variables. The present study introduces the Portuguese Stress Corpus (PSC, author). PSC was developed to provide researchers with a large and reliable annotated lexicon of Portuguese. The corpus contains nearly all non-verbs in the language (n = 154,611), and is based on the wordlist found in the Houaiss Dictionary [4], the most comprehensive dictionary in the Portuguese language. Background Weight, Syllables and Intervals: In languages where stress is weight-sensitive, syllables with greater weight are more likely to be prominent, i.e., to attract stress [2]. In interval theory [5], greater weight entails greater duration in a given interval, defined as the rhythmic unit that spans from a vowel up to (but not including) the following vowel; i.e., V-to-(V). Segments preceding the leftmost vowel are not included in any interval. Intervals (ι) have no a priori constituency, and predict different rhythmic units when compared to syllables (σ): onset segments in a given syllable are computed as part of the preceding interval. For example, the string CVCσCCVCσ is equivalent to 〈C〉VCCCιVCι in interval theory. The onset effects found in the Portuguese Stress Corpus motivate intervals, as they are negatively correlated with stress. Portuguese: Previous analyses ([1], [6], among others) have argued that weight effects on stress in Portuguese non-verbs are restricted to the word-final syllable (stress in verbs is not phonologically conditioned): stress is final if the word-final syllable is heavy. Otherwise, stress falls on the penult syllable (regardless of weight). Both final and penult stress patterns are (mostly) regular. Antepenult stress is irregular, and no pre-antepenult stress is allowed. Approximately 72% of the non-verbs in the language (N=163,626) have regular/predictable stress. Researchers have employed different factors to account for stress regularities (e.g., foot binarity, foot type, metrical alignment) and irregularities (e.g., extrametricality, catalexis, theme vowel influence) in the language. Under previous analyses, irregular and regular words are by definition treated differently. Methodology: In the present study, stress in the Portuguese Stress Corpus was modelled using Binomial Logistic Regressions (glm() in R). Given that three stress positions are possible, two binomial models were required to predict stress: one model predicts antepenult stress vs. penult or final stress ((1-a), (2-a)). Because all words with antepenult stress are considered to be irregular, this model predicts the following: given a word, how likely is it to bear antepenult stress as opposed to penult or final stress? Another model predicts penult vs. final stress ((1-b) and (2-b)); i.e., the two regular positions in the language. Because syllables and intervals are compared, a total of four models are used. Syllables have three constituents (onset, nucleus and coda). Intervals, on the other hand, are a single string of segments (no constituency). As a result, 9 predictors are used in (1-a), 6 in (1-b); 3 in (2-a), and 2 in (2-b).
منابع مشابه
Studying impressive parameters on the performance of Persian probabilistic context free grammar parser
In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...
متن کاملCoGrOO: a Brazilian-Portuguese Grammar Checker based on the CETENFOLHA Corpus
This paper describes an ongoing Portuguese Language grammar checker project, called CoGrOO1-Corretor Gramatical para OpenOffice (Grammar Checker for OpenOffice), based on CETENFOLHA, a Brazilian Portuguese morphosyntactic annotated Corpus. Two of its features are highlighted: hybrid architecture, mixing rules and statistics; free software project. This project aims at checking grammatical error...
متن کاملPhonologic Patterns of Brazilian Portuguese: a grapheme to phoneme converter based study
This paper presents Brazilian Portuguese phoneme patterns of distribution, according to an automatic grammar rulesbased grapheme to phoneme converter. The software Nhenhém (Vasilévski, 2008) was used for treating data: written texts which were decoded into phonologic symbols, forming a corpus, and subjected to a statistical analysis. Results support the high level of predictability of Brazilian...
متن کاملRule-based vs. Probabilistic Surface Realisation of Definite Descriptions
We describe the evaluation work of two standard approaches to the surface realisation of definite descriptions as Portuguese text. Taking as an input a non-linguistic representation of the description to be generated, a rule-based approach makes use of grammar constraints to compute the appropriate surface string, whereas a competing probabilistic model applies n-gram statistics to the same tas...
متن کاملThe Presence and Influence of English in the Portuguese Financial Media
As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015